Topical and Lexical Similarity
Much criticism of Donald Trump has centered on him being unfit for the Office of President of the United States, or his “unpresidentiality”.
Some of this criticism also emerges from Donald Trump’s rhetorical style, which has also been deemed “unpresidential”.
But what does “Presidentiality” mean? Are there common traits, character qualities, rhetorical styles, or other elements that are common to US Presidents?
Have US Presidents throughout history given similar speeches and official addresses to each other?
What have been the most common topics they have given speeches on and how have they changed over time?
The Miller Center at the University of Virginia’s ‘Presidential Speeches’ has speeches available from George Washington till today are available in text format
Most speeches are official addresses, remarks, or statements
Available free to download as JSON file format to the public
Collection is not exhaustive, it is extensive and contains over 1,000 speeches within it
Temporal shifts in American society; Realignment in American domestic/foreign policy
Only Presidents from 20th Century onward - Theodore Roosevelt (1901)
Only speeches while President, no campaigns or other speeches, consistency
Historical trend of topics relevant today but not include archaic topics (slavery, railroads, etc.)
Minimal cleaning to maintain semantic and contextual coherence for my models
BERTopic groups similar speeches using language patterns which identified key themes automatically with advanced language models
Looks at context of words in relation to each other to find and build common topics
Measures how similar speeches are by comparing them in vector space
Uses word embeddings to capture semantic meaning, not just keywords
Helps identify subtle language patterns and thematic connections
Compares speeches based on word frequency adjusted by overall rarity
Effective for spotting shared vocabulary across texts
Less suited for capturing semantic meaning compared to embedding strategy
Firstly, that the proportion of topic prevalence in Presidential speeches is time dependent – it fluctuates in accordance with changes in the political spheres, either global or domestic.
Secondly, both with content and vocabulary, there is a similarity between Presidents from Coolidge until Clinton, after which there is a break and a set of new similarities that begin.
Thirdly and lastly, while Donald Trump’s rhetorical choices have been criticized as the great break from previous Presidents, this is only verifiable in terms of topic/thematic consistency and not in terms of vocabulary.
This is a technical appendix for the operations performed to create this memo.
keep_presidents = [
"Theodore Roosevelt", "William Taft", "Woodrow Wilson",
"Warren G. Harding", "Calvin Coolidge", "Herbert Hoover",
"Franklin D. Roosevelt", "Harry S. Truman", "Dwight D. Eisenhower",
"John F. Kennedy", "Lyndon B. Johnson", "Richard M. Nixon",
"Gerald Ford", "Jimmy Carter", "Ronald Reagan",
"George H. W. Bush", "Bill Clinton", "George W. Bush",
"Barack Obama", "Donald Trump", "Joe Biden"
]
df_new = df[df["president"].isin(keep_presidents)]
df_new = df_new.drop(['doc_name', 'title'], axis=1)
df_new.sort_values("date", inplace=True)
df_new.reset_index(drop=True, inplace=True)president_terms = {
"Theodore Roosevelt": ("1901-09-14", "1909-03-04"),
"William Taft": ("1909-03-04", "1913-03-04"),
"Woodrow Wilson": ("1913-03-04", "1921-03-04"),
"Warren G. Harding": ("1921-03-04", "1923-08-02"),
"Calvin Coolidge": ("1923-08-02", "1929-03-04"),
"Herbert Hoover": ("1929-03-04", "1933-03-04"),
"Franklin D. Roosevelt": ("1933-03-04", "1945-04-12"),
"Harry S. Truman": ("1945-04-12", "1953-01-20"),
"Dwight D. Eisenhower": ("1953-01-20", "1961-01-20"),
"John F. Kennedy": ("1961-01-20", "1963-11-22"),
"Lyndon B. Johnson": ("1963-11-22", "1969-01-20"),
"Richard M. Nixon": ("1969-01-20", "1974-08-09"),
"Gerald Ford": ("1974-08-09", "1977-01-20"),
"Jimmy Carter": ("1977-01-20", "1981-01-20"),
"Ronald Reagan": ("1981-01-20", "1989-01-20"),
"George H. W. Bush": ("1989-01-20", "1993-01-20"),
"Bill Clinton": ("1993-01-20", "2001-01-20"),
"George W. Bush": ("2001-01-20", "2009-01-20"),
"Barack Obama": ("2009-01-20", "2017-01-20"),
"Donald Trump": ("2017-01-20", "2021-01-20"),
"Joe Biden": ("2021-01-20", "2025-01-20"),
"Donald Trump": ("2024-01-20", "2025-04-27"),
}import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import contextlib
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP
umap_model = UMAP(random_state=42)
#embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
#vectorizer_model = CountVectorizer(stop_words="english")
topic_model = BERTopic(
#embedding_model=embedding_model,
#vectorizer_model=vectorizer_model,
calculate_probabilities=True,
verbose=False,
umap_model = umap_model,
#top_n_words=7,
#nr_topics="auto",
)
topics, probs = topic_model.fit_transform(docs_chunked)topic_info = topic_model.get_topic_info()[['Topic', 'Count', 'Name', 'Representation']]
topic_info_simple = topic_info[["Topic", 'Count', "Representation"]]
topic_info_simple['Representation'] = topic_info_simple['Representation'].apply(lambda x: ' '.join(dict.fromkeys(x).keys()))
frequency_table = pd.DataFrame(topic_info_simple)chunked_data = []
for idx, row in df_proper.iterrows():
chunks = chunk_text(row['cleaned_text'], max_words=300)
for chunk in chunks:
chunked_data.append({
"original_speech_id": row['speech_id'],
"president": row['president'],
"date": row['date'],
"transcript": chunk
})
df_chunked = pd.DataFrame(chunked_data)excluded_topics = [-1]
def reassign_topic(topic, prob_row):
if topic in excluded_topics:
sorted_indices = np.argsort(prob_row)[::-1]
for idx in sorted_indices:
if idx not in excluded_topics:
return idx
return topic
else:
return topic
df_chunked["topic"] = [
reassign_topic(t, p) for t, p in zip(df_chunked["topic"], probs)
]df_count_by_year = df_top_5.groupby(['year', 'topic']).size().reset_index(name='count')
df_total_by_year = df_chunked.groupby('year').size().reset_index(name='total')
df_count_by_year = pd.merge(df_count_by_year, df_total_by_year, on='year')
df_count_by_year['proportion'] = df_count_by_year['count'] / df_count_by_year['total']
df_count_by_year['topic_labels'] = df_count_by_year["topic"].map(topic_labels)library(ggplot2)
count_by_year <- reticulate::py$df_count_by_year
ggplot(count_by_year, aes(x = year, y = proportion, fill = manual_labels)) +
geom_area() +
theme_minimal() +
labs(title = "Top 5 Topics in Presidential Speeches Over Time",
x = "Year",
y = "Proportion of Speeches",
fill = "Topic") +
scale_fill_viridis_d() +
theme(plot.title = element_text(size = 12,face='bold'),
legend.position = "bottom",
legend.text = element_text(size = 6))ggplot(count_by_year, aes(x = year, y = proportion)) +
geom_line(aes(color = manual_labels)) +
facet_wrap(~ manual_labels, scales = "free_y") +
theme_minimal() +
labs(title = "Top 5 Topics in Presidential Speeches Over Time",
x = "Year",
y = "Proportion of Speeches",
color = "Topic") +
scale_color_viridis_d() +
theme(
legend.position = "none",
plot.title = element_text(size = 10,face='bold')
)library(reshape2)
library(viridis)
similarity_matrix <- reticulate::py$similarity_df
similarity_matrix <- as.matrix(similarity_matrix)
melted_matrix <- melt(similarity_matrix, varnames = c("president_1", "president_2"))
ggplot(melted_matrix, aes(x = president_1, y = president_2, fill = value)) +
geom_tile(color = "white", linewidth = 0.3) +
scale_fill_viridis(
option = "viridis", # Try "magma", "plasma", or "inferno" for other variants
direction = -1,
limits = c(min(melted_matrix$value), max(melted_matrix$value))
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 6),
axis.text.y = element_text(size = 10),
legend.position = "right",
plot.title = element_text(size = 10,face='bold')
) +
labs(
x = "President",
y = "President",
title = "Word Embedding - Cosine Similarity of Presidential Speeches",
fill = "Similarity"
) +
coord_fixed() from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
words = text.split()
filtered_words = [word for word in words if word.lower() not in stop_words]
return " ".join(filtered_words)presidents_aggregated = df_proper.groupby('president')['cleaned_text'].apply(" ".join).reset_index()
presidents_aggregated['cleaned_text'] = presidents_aggregated['cleaned_text'].apply(remove_stopwords)
vectorizer = CountVectorizer()
president_dfm = vectorizer.fit_transform(presidents_aggregated['cleaned_text'])similarity_matrix_2 <- reticulate::py$similarity_df_2
similarity_matrix_2 <- as.matrix(similarity_matrix_2)
melted_matrix_2 <- melt(similarity_matrix_2, varnames = c("president_1", "president_2"))
ggplot(melted_matrix_2, aes(x = president_1, y = president_2, fill = value)) +
geom_tile(color = "white", linewidth = 0.3) +
scale_fill_viridis(
option = "viridis", # Try "magma", "plasma", or "inferno" for other variants
direction = -1,
limits = c(min(melted_matrix_2$value), max(melted_matrix_2$value))
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 6),
axis.text.y = element_text(size = 10),
legend.position = "right",
plot.title = element_text(size = 10, face='bold')
) +
labs(
x = "President",
y = "President",
title = "TF-IDF Cosine Similarity of Presidential Speeches",
fill = "Similarity"
) +
coord_fixed() Singh (JCU): Presidential Speeches